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DETAILED ACTION 

Continued Examination Under 37 CFR 1.114 

1 . A request for continued examination under 37 CFR 1.114, including the fee set 
forth in 37 CFR 1 .17(e), was filed in this application after final rejection. Since this 
application is eligible for continued examination under 37 CFR 1.114, and the fee set 
forth in 37 CFR 1 .17(e) has been timely paid, the finality of the previous Office action 
has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 
12/8/2009 has been entered. 

Response to Amendment 

2. Claims 1-19 are pending. 

3. Claims 12, 13, 17, and 18 have been amended. 

4. The objection to claim 1 3 has been withdrawn in view of the amendments 
received 12/8/2009. 

5. The 35 USC 1 1 2 2nd paragraph rejections have been withdrawn in view of the 
amendments received 12/8/2009. 

Response to Arguments 

6. Applicant argues Cardillo does not disclose "receiving input from a user 
identifying at least two portions of a first set of audio signals as being of interest to the 
user." At most, Cardillo teaches receiving an indication that a text-based single- or multi- 
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word term is of interest to a user." (Remarks, Page 8, H 3) 

The Examiner disagrees. A multi-word text query would teach the given 
limitation. The input is received from a user and identifies at least two portions of a first 
set of audio signals as being of interest to the user. The broadest reasonable 
interpretation of the identification of at least two portions can be a multi-word text query. 
The argument is not persuasive. 

7. Applicant further argues "Further, Cardillo does not disclose "processing... each 
identified portion of the first set of audio signals to generate a corresponding subword 
unit representation of the identified portion; [and] forming.., a representation of a spoken 
event of interest, wherein the forming includes combining the subword unit 
representations of the respective identified portions of the first set of audio signals." At 
most, Cardillo teaches forming a representation of a text-based query by probing a 
phonetic dictionary and/or consulting a spelling-to-sound database. See, e.g., page 12 
of Cardillo: "A phonetic dictionary is probed for each word within the query term to 
accommodate unusual terms (whose pronunciations must be handled specially for the 
given natural language) as well as very common words (for whom performance 
optimization is worthwhile). Any word not found in the dictionary is then processed by 
consulting a spelling-to-sound data base to extract likely phonetic representations given 
the word's orthography." (Remarks, Pages 8-9) 

In response to applicant's argument that the references fail to show certain 
features of applicant's invention, it is noted that the features upon which applicant relies 
(i.e., ...combining the subword unit representations...) are not recited in the previously 
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rejected claim(s). Although the claims are interpreted in light of the specification, 
limitations from the specification are not read into the claims. See In re Van Geuns, 988 
F.2d 1 181 , 26 USPQ2d 1057 (Fed. Cir. 1993). However, since the same references are 
applied, the Examiner will respond accordingly. Wolf teaches that an input spoken query 
is reduced to a phoneme lattice (See Fig 1 and 1f 0022) Since the phoneme lattice is of 
a combination of phonemic interpretations of the utterance, it teaches combining the 
subword unit representations of the respective identified portions of the first set of audio 
signals. The argument is not persuasive. 

8. Applicant further argues "Even if, for the sake of argument only, Wolfs "spoken 
query" is read as corresponding to the recited "spoken event of interest," and even if the 
audio signals corresponding to the spoken query that is provided to the speech 
recognition engine is read as corresponding to the recited "first set of audio signals," no 
portion of Wolf provides any hint or disclosure of "receiving input from a user identifying 
at least two portions of a first set of audio signals as being of interest to the user," much 
less "processing... each identified portion of the first set of audio signals to generate a 
corresponding subword unit representation of the identified portion; [and] forming.., a 
representation of a spoken event of interest, wherein the forming includes combining 
the subword unit representations of the respective identified portions of the first set of 
audio signals," as required in amended claim 1" (Remarks, Page 9, If 3) 

In response to applicant's arguments against the references individually, one 
cannot show nonobviousness by attacking references individually where the rejections 
are based on combinations of references. See In re Keller, 642 F.2d 413, 208 



Application/Control Number: 10/565,570 Page 5 

Art Unit: 2626 

USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 
1986). Both Wolf and Cardillo reduce the query (text/speech) to a subword unit 
representation. Wolf then goes on to further teach the phoneme lattice as shown in the 
above response. The argument is not persuasive. 

Information Disclosure Statement 

9. The Information Disclosure Statement (IDS) submitted on 12/8/2009 is not in 
compliance with the provisions of 37 CFR 1 .97. The Charlesworth reference is 
incorrectly numbered as 6873992 when it should be 6873993. However, to correct this, 
the Examiner will include it on the 892 to put the reference on record. 

Claim Rejections - 35 USC § 101 

35 U.S.C. 101 reads as follows: 

Whoever invents or discovers any new and useful process, machine, manufacture, or composition of 
matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the 
conditions and requirements of this title. 

10. Claim(s) 1 is rejected under 35 USC 101 for being nonstatutory. Under the most 
recent interpretation of the Interim Guidelines regarding 35 U.S.C. 101, a method claim 
must (1 ) be tied to another statutory class or (2) transform underlying subject matter to a 
different state or thing. If no transformation occurs, the claim(s) should positively recite 
the other statutory class to which it is tied to qualify as a statutory process under 35 
U.S.C. 101. As for guidance to areas of statutory subject matter, see 35 U.S.C. 101 
Interim Guidelines (with emphasis of the Clarification of "processes" under 35 USC 
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101); As an example, the claim(s) could identify the apparatus that accomplishes the 
method steps, or positively recite the subject matter that is being transformed. After 
further consideration by the Examiner, the system (word spotting system) which was 
used to satisfy the machine of the machine or transformation test, can be entirely 
software (see H 0059, ...Alternative systems that implement the techniques described 
above can be implemented in software...) 

Claim Rejections - 35 USC § 103 

The following is a quotation of 35 U.S.C. 1 03(a) which forms the basis for all 
obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set 
forth in section 102 of this title, if the differences between the subject matter sought to be patented and 
the prior art are such that the subject matter as a whole would have been obvious at the time the 
invention was made to a person having ordinary skill in the art to which said subject matter pertains. 
Patentability shall not be negatived by the manner in which the invention was made. 

1 1 . Claims 1 -4, 8-9, 1 2-1 9 are rejected under 35 U.S.C. 1 03(a) as being 
unpatentable over Cardillo et al. (NPL Document Phonetic Searching vs. LVCSR: How 
to Find What You Really Want in Audio Archives") in view of Wolf et al. (US PGPUB 
#20030204492) 

As per claim 1 , Cardillo teaches: 

receiving input from a user identifying at least two portions of a first set of audio 
signals as being of interest to the user; (Page 12, column 1, ...words or 

phrases... a phrase would include at least two portions (words), also the query 
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representations can be phonetic and would thus also teach phonetic portions.) 

processing, by a query recognizer of a word spotting system, each identified 
portion of the first set of audio signals to generate a corresponding subword unit 
representation of the identified portion; (Page 12, column 1, ...words or 

phrases... a phrase would include at least two portions (words), also the query 
representations can be phonetic and would thus also teach phonetic portions. Further, 
. . .A phonetic dictionary is probed for each word within the query term. ..) 

accepting, by a word spotting engine of the word spotting system, data 
representing the unknown speech in a second audio signal; and (Fig. 1 , 
preprocessing phase and generation of the search track. Page 11, column 2, ...A 
phonetic grammar likewise depends upon the natural language in use (particularly the 
set of phonemes used to represent basic sounds and meanings of the input speech). 
This grammar is used to identify likely end points of words in the input speech...) 

Cardillo fails to fully teach, but Wolf teaches: 

forming, by the query recognizer of a word spotting system, a representation of a 
spoken event of interest, wherein the forming includes combining the subword unit 
representations of the respective identified portions of the first set of audio signals; 
(Cardillo, Page 15, column 2, teaches multiple word queries which are two or more 
instances of a event of interest. They are performed in a phonetic search therefore they 
comprise at least one sequence of phonemes each. Again, Cardillo fails to specifically 
teach a spoken event of interest. Wolf, U 0054, ...a phoneme lattice, as described 
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above, can also be used for devices with limited resources. . .In the case where the 
recognizer is part of the input device, e.g., a cell phone, the lattices can be forwarded to 
the search engine 190... Furthemore, 1f 0022, ...If phonemes are used, then it is 
possible to handle words that sound the same but have different meaning. . . A phoneme 
lattice (combination) is used to represent multiple pronunciations of words.) 

locating, by the word spotting engine of the word spotting system, putative 
instances of the spoken event of interest in the second audio signal using the 
representation of the spoken event of interest, wherein the locating includes identifying 
time locations of the second audio signal at which the spoken event of interest is likely 
to have occurred based on a comparison of the data representing the unknown speech 
with the representation of the spoken event of interest. (Cardillo teaches in 

Fig. 1 , a searching phase which locates putative instances of a query where .Page 12, 
Temporal_Offset teaches a time location. Cardillo fails to specifically teach a spoken 
event of interest (spoken query). Wolf, abstract, ...A spoken query is represented as a 
lattice... ) 

It would have been obvious to someone of ordinary skill in the art to combine 
Wolf with Cardillo to avoid losing information and adding ambiguity by application of a 
speech recognition process and performing traditional textual document retrieval. (Wolf, 
H 0006) 

As per claim 2, claim 1 is incorporated and Cardillo teaches: 

wherein processing each identified portion of the first set of audio signals 
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comprises applying a computer- implemented speech recognition algorithm to data 
representing the first set of audio signals. (Page 1 1 , Preprocessing, 

Acoustic Model, and Phonetic Grammar, ...preprocessing engine scans the input 
speech and produces the corresponding phonetic search track...) 

As per claim 3, claim 1 is incorporated and Cardillo teaches: 

wherein the subword units include linguistic units. (Page 1 1 , Preprocessing, 
Acoustic Model, and Phonetic Grammar, ...preprocessing engine scans the input 
speech and produces the corresponding phonetic search track...) 

As per claim 4, claim 2 is incorporated and Cardillo fails to fully teach but Wolf teaches: 

wherein locating the putative instances includes applying a computer- 
implemented word spotting algorithm configured using the representation of the spoken 
event of interest. (Cardillo teaches phonetic keyword spotting but fails to teach 

that the event of interest (query) is spoken. Wolf, U 0054, ...a phoneme lattice, as 
described above, can also be used for devices with limited resources... In the case 
where the recognizer is part of the input device, e.g., a cell phone, the lattices can be 
forwarded to the search engine 190...) 

It would have been obvious to someone of ordinary skill in the art to combine 
Wolf with Cardillo to avoid losing information and adding ambiguity by application of a 
speech recognition process and performing traditional textual document retrieval. (Wolf, 
H 0006) 
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As per claim 8, claim 1 is incorporated and Cardillo fails to fully teach but Wolf teaches: 

wherein the representation of the spoken event of interest defines a network of 
subword units. (Wolf, If 0054, ...a phoneme lattice, as described above, can 

also be used for devices with limited resources... In the case where the recognizer is 
part of the input device, e.g., a cell phone, the lattices can be forwarded to the search 
engine 190...) 

It would have been obvious to someone of ordinary skill in the art to combine 
Wolf with Cardillo to avoid losing information and adding ambiguity by application of a 
speech recognition process and performing traditional textual document retrieval. (Wolf, 
If 0006) 

As per claim 9, claim 8 is incorporated and Cardillo fails to fully teach but Wolf teaches: 

wherein the network of subword units is formed by multiple sequences of 
subword units that correspond to different paths through the network. (Wolf, If 
0054, ...a phoneme lattice, as described above, can also be used for devices with 
limited resources... In the case where the recognizer is part of the input device, e.g., a 
cell phone, the lattices can be forwarded to the search engine 190... Figs 3a and 3b 
show the lattices.) 

It would have been obvious to someone of ordinary skill in the art to combine 
Wolf with Cardillo to avoid losing information and adding ambiguity by application of a 
speech recognition process and performing traditional textual document retrieval. (Wolf, 
If 0006) 
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As per claim 12, claim 1 is incorporated and Cardillo fails to fully teach but Wolf 
teaches: 

accepting first audio data representing utterances of the event of interest spoken 
by a user, and processing the first audio data to form a processed query. 
(Wolf, U 0054, ...a phoneme lattice, as described above, can also be used for devices 
with limited resources. . .In the case where the recognizer is part of the input device, 
e.g., a cell phone, the lattices can be forwarded to the search engine 190... Figs 3a and 
3b show the lattices.) 

It would have been obvious to someone of ordinary skill in the art to combine 
Wolf with Cardillo to avoid losing information and adding ambiguity by application of a 
speech recognition process and performing traditional textual document retrieval. (Wolf, 
H 0006) 

As per claim 13, claim 1 is incorporated and Cardillo teaches: 

accepting a selection by the user of portions of stored data from the first set of 
audio signals, and processing the portions of the stored data to form a processed query. 
(Cardillo, Page 14, teaches the use of the AudioLogger™ product for selecting media 
segments. Although the capabilities of AudioLogger™ are not discussed in Cardillo, 
NPL Document "Inventory of Metadata for Multimedia" teaches in section 3.4 that ...The 
Virage Tools automatically index your video content. The tools use speech recognition 
technology to extract information from the audio signal... Therefore, the selection is an 
inherent feature of AudioLogger™ of which Cardillo incorporates by reference.) 
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As per claim 14, claim 13 is incorporated and Cardillo teaches: 

prior to accepting the selection by the user, processing the first set of audio 
signals according to a first computer-implemented speech recognition algorithm to 
produce the stored data. (Cardillo, Page 1 4, teaches the use of the 

AudioLogger™ product for selecting media segments. Although the capabilities of 
AudioLogger™ are not discussed in Cardillo, NPL Document "Inventory of Metadata for 
Multimedia" teaches in section 3.4 that ...The Virage Tools automatically index your 
video content. The tools use speech recognition technology to extract information from 
the audio signal. A publishing application makes it possible to publish the searchable 
information... Therefore, prior to the selection of the searchable information, a speech 
recognition algorithm is performed to produced the stored searchable data. The 
selection is an inherent feature of AudioLogger™ of which Cardillo incorporates by 
reference.) 

As per claim 15, claim 14 is incorporated and Cardillo teaches: 

wherein the first speech recognition algorithm produces data related to presence 
of the subword units at different times in the first set of audio signals. (Cardillo 
teaches in Fig. 1 , a searching phase which locates putative instances of a query where 
,Page 12, Temporal_Offset teaches a time location. Thus the collection of subword units 
generated by the first speech recognition algorithm is indicative of subword units at 
different times in the first set of audio signals.) 
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As per claim 16, claim 14 is incorporated and Cardillo teaches: 

applying a second speech recognition algorithm to the processed query. 
(Cardillo, Page 1 1 , column 2, ...It is even conceivable that dynamic channel 
and language detection could be employed to switch acoustic models during 
preprocessing... Furthermore, Page 12, column 1, ...A phonetic dictionary is probed for 
each word within the query term to accommodate unusual words (whose pronunciations 
must be handled specially for the given natural language. . . It would have been obvious 
to someone of ordinary skill in the art at the time of the invention that if the audio is 
indexed in an alternative language and a person wants to search in a different language 
(North American English vs. Castilian Spanish), a second speech recognition algorithm 
with a phonetic dictionary specific to the language could have been applied to the 
processed query.) 

Claims 17 and 18 are the computer readable medium and hardware 
representations of the method as claimed in claim 1 . Claims 17 and 18 are rejected 
under the same principles as claim 1 for having similar limitations. Cardillo, Page 20, 
table 3 teaches computer implemented means for the method. The computer teaches a 
system, and the system cannot operate without being programmed, so a computer 
readable medium is inherent. 

As per claim 19, claim 18 is incorporated and Cardillo fails to specifically teach, but Wolf 
teaches: 
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wherein the word spotter is further configured to identify time locations of the 
second audio signal at which the spoken event of interest is likely to have occurred 
based on a comparison of the data representing the unknown speech with the 
representation of the spoken event of interest. (Cardillo teaches in Fig. 1 , a 

searching phase which locates putative instances of a query where ,Page 12, 
Temporal_Offset teaches a time location. Cardillo fails to specifically teach a spoken 
event of interest (spoken query). Wolf, abstract, ...A spoken query is represented as a 
lattice...) 

It would have been obvious to someone of ordinary skill in the art to combine 
Wolf with Cardillo to avoid losing information and adding ambiguity by application of a 
speech recognition process and performing traditional textual document retrieval. (Wolf, 
H 0006) 

12. Claims 5-7, 1 0-1 1 are rejected under 35 U.S.C. 1 03(a) as being unpatentable 
over Cardillo et al. (NPL Document Phonetic Searching vs. LVCSR: How to Find What 
You Really Want in Audio Archives") in view of Wolf et al. (US PGPUB #20030204492) , 
and further in view of Ferrieux et al. (NPL Document "PHONEME-LEVEL INDEXING 
FOR FAST AND VOCABULARY-INDEPENDENT VOICE/VOICE RETRIEVAL") 

As per claim 5, claim 4 is incorporated and Cardillo and Wolf fail to teach, but Ferrieux 
teaches: 

selecting processing parameter values of the speech recognition algorithm for 
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application to the data representing the first set of audio signals according to 
characteristics of the word spotting algorithm. (Section 2.2, ... To compute the 

optimal parameters of those models,. . . ) 

It would have been obvious to someone of ordinary skill in the art at the time of 
the invention to combine Ferrieux with Cardillo and Wolf to optimize the parameters of 
the system based on the desired outcome (maximize likelihood/minimize error retrieval 
rate). (Ferrieux, section2.2) 

As per claim 6, claim 5 is incorporated and Cardillo and Wolf fail to teach, but Ferrieux 
teaches: 

wherein the selecting of the processing parameter values of the speech 
recognition algorithm includes optimizing said parameters according to an accuracy of 
the word spotting algorithm. (Section 2.2, ...To compute the optimal 

parameters of those models, . . .maximize likelihood. ..) 

It would have been obvious to someone of ordinary skill in the art at the time of 
the invention to combine Ferrieux with Cardillo and Wolf to optimize the parameters of 
the system based on the desired outcome (maximize likelihood/minimize error retrieval 
rate). (Ferrieux, section2.2) 

As per claim 7, claim 5 is incorporated and Cardillo and Wolf fail to teach, but Ferrieux 
teaches: 

wherein the selecting of the processing parameter values of the speech 
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recognition algorithm includes selecting values for parameters including one or more of 
an insertion factor, a recognition search beam width, a recognition grammar factor, and 
a number of recognition hypotheses. (Ferriex, section 2.1 defines the models 

which include insertion costs, section 2.2 optimizes them.) 

It would have been obvious to someone of ordinary skill in the art at the time of 
the invention to combine Ferrieux with Cardillo and Wolf to optimize the parameters of 
the system based on the desired outcome (maximize likelihood/minimize error retrieval 
rate). (Ferrieux, section2.2) 

As per claim 10, claim 1 is incorporated and Cardillo and Wolf fail to teach, but Ferrieux 
suggests: 

wherein forming the representation of the spoken event of interest includes 
determining an n-best list of recognition results. (Section 3, Synchronized 
Lattices, ...for each phoneme of the 1 -best sequence, the posterior probabilities of the 
N-1 other phonemes on the same interval are also given... Ferrieux further provides 
...the single best sequence of phonemes often contains errors, and although systematic 
errors are handled by the confusion matrix, it is felt that knowledge of the second-best 
match would most often yield greater precision... Therefore, given that the single best 
sequence of phonemes often contains errors, it would have been obvious to someone 
of ordinary skill in the art that multiple sequences through the lattice would have been 
beneficial to boost precision due to a higher number of sequences.) 

It would have been obvious to someone of ordinary skill in the art at the time of 
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the invention to combine Ferrieux with Cardillo and Wolf to optimize the parameters of 
the system based on the desired outcome (maximize likelihood/minimize error retrieval 
rate). (Ferrieux, section2.2) 

As per claim 1 1 , claim 1 0 is incorporated and Cardillo and Wolf fail to teach, but 
Ferrieux suggests: 

wherein each sequence of subword units in the representation corresponds to a 
different one in the n-best list of recognition results. (Section 3, Synchronized 
Lattices, ...for each phoneme of the 1 -best sequence, the posterior probabilities of the 
N-1 other phonemes on the same interval are also given... Ferrieux further provides 
...the single best sequence of phonemes often contains errors, and although systematic 
errors are handled by the confusion matrix, it is felt that knowledge of the second-best 
match would most often yield greater precision. . . Therefore, given that the single best 
sequence of phonemes often contains errors, it would have been obvious to someone 
of ordinary skill in the art that multiple sequences through the lattice would have been 
beneficial to boost precision due to a higher number of sequences. The phoneme 
sequences are the matches (which defines the first and second options) in Ferrieux, 
therefore it would have further been obvious that the phoneme sequences through the 
lattice would have been each of the different options in the n-best criterion evaluations.) 

It would have been obvious to someone of ordinary skill in the art at the time of 
the invention to combine Ferrieux with Cardillo and Wolf to optimize the parameters of 
the system based on the desired outcome (maximize likelihood/minimize error retrieval 
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rate). (Ferrieux, section2.2) 
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Conclusion 

13. The prior art made of record and not relied upon is considered pertinent to 
applicant's disclosure. Refer to PTO-892, Notice of References Cited for a listing of 
analogous art. 

14. Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to GREG A. BORSETTI whose telephone number is 
(571)270-3885, (FAX: 571-270-4885). The examiner can normally be reached on 
Monday - Thursday (8am - 5pm Eastern Time). 

If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
supervisor, RICHEMOND DORVIL can be reached on 571-272-7602. The fax phone 
number for the organization where this application or proceeding is assigned is 571- 
273-8300. 

Information regarding the status of an application may be obtained from the 
Patent Application Information Retrieval (PAIR) system. Status information for 
published applications may be obtained from either Private PAIR or Public PAIR. 
Status information for unpublished applications is available through Private PAIR only. 
For more information about the PAIR system, see http://pair-direct.uspto.gov. Should 
you have questions on access to the Private PAIR system, contact the Electronic 
Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a 
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USPTO Customer Service Representative or access to the automated information 
system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. 

/Greg A. Borsetti/ 
Examiner, Art Unit 2626 



/Talivaldis Ivars Smits/ 
Primary Examiner, Art Unit 2626 
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